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Abstract 

We compare the performance of a recurrent neural network with 
the best results published so far on phoneme recognition in the TIMIT 
database. These published results have been obtained with a combina- 
tion of classifiers. However, in this paper we apply a single recurrent 
neural network to the same task. Our recurrent neural network attains 
an error rate of 24.6%. This result is not significantly different from 
that obtained by the other best methods, but they rely on a combina- 
tion of classifiers for achieving comparable performance. 

1 Introduction 

Spontaneous speech production is a continuous and dynamic process. This 
continuity is reflected in the acoustics of speech sounds and, in particular, 
in the transitions from one speech sound to another. As a consequence, the 
boundaries between speech sounds are not clearly defined. This fact sig- 
nificantly contributes to making segmentation and labelling of speech data 
interrelated tasks. Because of this interrelation, automatic speech recogni- 
tion is best performed with methods such as hidden Markov models (HMM) 
that do not require segmented data for development. On the contrary, de- 
veloping neural networks has traditionally relied on segmented data. The 
objective functions require a network output target value at every or spe- 
cific time-steps in the data sequence. Connectionist temporal classification 
(CTC) overcomes this limitation. CTC allows developing neural network 
classifiers using a sequence of labels as the desired output target [5j. Labels 
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correspond to events occurring in the input data sequence, such as phones 
in a speech data stream. The number of labels in a target labelling is, there- 
fore, typically much shorter than the number of time-steps in the input data 
sequence. Also, there is not timing information in a target labelling, except 
for labels being in the same order in which events occur in the input data 
sequence. 

Recurrent neural networks are an interesting alternative to HMMs for 
speech recognition. Their continuous internal state is naturally well suited 
for modelling speech dynamics. Moreover, their capability to model data 
dependencies has potential for modelling coarticulatory effects in speech. In 
contrast, HMMs are built on a number of independence assumptions about 
the data. 

We showed in that CTC-based recurrent neural networks outperform 
state-of-the-art algorithms on phoneme recognition in the TIMIT database. 
In contrast with the algorithms compared in [5], which rely on a single type 
of classifier to perform the task. Glass' uses a committee-based classifier 
whereas Deng et aVs combines the scores from two related algorithms [1]. 
These two systems achieved the best phoneme recognition rates published 
so far for TIMIT. In this paper, we compare the performance of a single 
CTC-based recurrent neural network with that of Glass' and Deng et aVs 
systems. The main differences with respect to the experimental setup used 
in [5] are: first, the data are divided into training, validation and test sets as 
described in [8], and second, a standard set of 39 phonetic categories, instead 
of 61, is used [lOj . This new experimental setup allows a direct comparison 
of the three systems. 

2 Materials 

The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) 
contains recordings of prompted English speech accompanied by manually 
segmented phonetic transcripts [2j. TIMIT contains a total of 6300 sen- 
tences, 10 sentences spoken by each of 630 speakers from 8 major dialect 
regions of the United States. 

For the experiments, the SA sentences were discarded and the remaining 
data were split into a training set, a validation set and a test set according 
to [8]. The training set contains 3696 sentences (462 speakers), the valida- 
tion set contains 400 sentences (50 speakers) and the test set contains 192 
sentences (24 speakers). 

TIMIT transcriptions are based on 61 phones. Typically, 48 phones are 
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aa 


aa, ao 


ah 


ah, ax, ax-h 


er 


er, axr 


im 


hh, hv 


ih 


ih, ix 


1 


1, el 


m 


m, em 


n 


n, en, nx 


ng 


ng, eng 


sh 


sh, zh 


sil 


pel, tcl, kcl, bcl, del, gel, h#, pau, epi 


uw 


uw, ux 




q 



Table 1: Folding the 61 eategories in TIMIT onto 39 eategories (from |10j). 
The phones in the right eolumn are folded onto their eorresponding eategory 
in the left eolumn (the phone 'q' is disearded). All other TIMIT phones are 
left intaet. 



seleeted for modelling. Confusions among a number of these 48 phones 
are not counted as errors. Therefore, results are presented for 39 phonetie 
categories. We decided to train the network on transcriptions based on this 
lexicon of 39 phones. The 61 categories were folded onto 39 categories as 
described by Lee and Hon [10]. This is shown in tabled) 

Speech data was transformed into Mel frequency cepstral coefficients 
(MFCC) with the HTK software package [TT]. Spectral analysis was carried 
out with a 40 channel Mel filter bank from 64 Hz to 8 kHz. A pre-emphasis 
coefficient of 0.97 was used to correct spectral tilt. Twelve MFCC plus the 
0th order coefficient were computed on Hamming windows 25 ms long, every 
10 ms. Delta and Acceleration coefficients were added giving a vector of 39 
coefficients in total. For the network, the coefficients were normalised to 
have mean zero and standard deviation one over the training set. 

The division of TIMIT into the three aforementioned data sets and the 
presentation of results for 39 phones, was also adopted in [4, Ij. As acoustic 
features, Deng et al. used frequency- warped LPC cepstra pLj, instead of 
MFCC. For his part, Glass tried a number of variations and combinations of 
MFCC, perceptual linear prediction (PLP) cepstral coefficients, energy and 
duration [8lll]. Glass' system built 61 models, one for each of the 61 phones 
in TIMIT, and results were tabulated using the standard set of 39 phones. 



3 



3 Method 



The method employed is the same described in |5j. Briefly, phoneme recog- 
nition is performed with a recurrent neural network. The long short-term 
memory recurrent neural network (LSTM) was used because of its ability 
to bridge long time delays [9l|3]. The hidden units in an LSTM network are 
called memory blocks. Each memory block has one or more memory cells 
controlled by an input, an output and a forget gate. When the input gate 
is open incoming data is stored in the memory cell, and when the output 
gate is open data stored in the memory cell is sent to the output layer. 
The forget gate resets the memory cell. Gates can optionally have access 
to the data stored in the memory cell {peephole connections). Gates and 
the memory block input are typically connected to the same units in the 
network. These connections are trainable, thus the behaviour of the gates 
is not pre-determined, but rather learned during training. 

For phoneme recognition, where both anticipatory and carry-over coar- 
ticulatory effects are important, a bi-directional neural network is suitable. 
The bi-directional LSTM (BLSTM) [71 [6] has two separate recurrent hid- 
den layers, both of them connected to the same input and output layers. 
The forward recurrent network is presented with sequential data forward in 
time, from the beginning of the data sequence to time-step t. The backward 
recurrent network is presented with sequential data backwards in time, from 
the end of the data sequence to time-step t. At any time-step t, the network 
has access to all information in the data sequence. 

The BLSTM recurrent neural network was trained with the CTC algo- 
rithm using the list of phones in the speech utterances as target labellings [5] . 
Once the network has been trained, the predicted labelling for a new speech 
utterance can be directly read from its outputs. This method (best path 
decoding) is, however, not guaranteed to find the most probable labelling. 
A second method (prefix search decoding) consists in calculating the proba- 
bilities of successive extensions of labelling prefixes, which can then be used 
to find the most probable labelling. However, because this procedure is com- 
putationally intensive, it was separately calculated for sections of the output 
sequence. As a consequence, prefix search decoding is not guaranteed to find 
the most probable labelling but, in practice, it generally outperforms best 
path decoding [5j. 

In the experiments reported in this paper, the BLSTM-CTC network 
had an input layer of size 39, the forward and backward hidden layers had 
128 blocks each, and the output layer was size 40 (39 phones plus blank). 
The gates used a logistic sigmoid function in the range [0,1]. The input 
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layer was fully connected to the hidden layer and the hidden layer was fully 
connected to itself and the output layer. The total number of weights was 
183,080. 

Training of the BLSTM-CTC network was done by gradient descent with 
weight updates after every training example. In all cases, the learning rate 
was 10~^, momentum was 0.9, weights were initialized randomly in the range 
[—0.1,0.1] and, during training, Gaussian noise with a standard deviation 
of 0.6 was added to the inputs to improve generalisation. For prefix search 
decoding, an activation threshold of 0.9999 was used (see for a description 
of this parameter). 

Performance was measured as the normalised edit distance (label error 
rate; LER) between the target label sequence and the output label sequence 
given by the system. 

Deng et a/.'s hidden trajectory models (HTM) are a type of proba- 
bilistic generative model aimed at modelling speech dynamics and adding 
long-contextual-span capabilities that are missing in hidden Markov models 
(HMM) [1] . A thorough description of this system is available in [12] . HTM 
uses a bi-directional filter to estimate probabilistic speech data trajectories 
given a hypothesized phone sequence. This estimate is then used to compute 
the model likelihood score for the observed speech data. The search for the 
phone sequence with the highest likelihood is performed with an A* based 
lattice search and rescoring algorithm specifically developed for HTM. 

Glass's system is a segment-based speech recogniser (as opposed to frame- 
based recognisers) based on the detection of landmarks in the speech sig- 
nal [1]. Acoustic features are computed over hypothesized segments and 
at their boundaries. The standard decoding framework is modified and ex- 
tended to deal with this paradigm shift. 

4 Results 

Results are shown in table [21 Error rates include errors due to substitu- 
tions, insertions and deletions with respect to the reference transcription. 
Deng et aVs best result was achieved with a lattice-constrained A* search 
with weighted HTM, HMM, and language model scores p]. Glass's best re- 
sults were achieved with many heterogeneous information sources and clas- 
sifier combinations [¥]. A single BLSTM-CTC recurrent neural network at- 
tains an error rate of 24.6%, which is not significantly different from Deng et 
aVs or Glass's best results. It is likely that BLSTM-CTC can achieve im- 
proved performance when more sources of information are added and when 
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28.57% 




Deng et aVs baseline HMM [Ij 


25.17%, 


s.e. 0.20% 


BLSTM-CTC (best path decoding) 


24.93% 




Deng et aVs HTM-HMM [IJ 


24.93% 




Deng et aVs HTM-HMM [Ij 


24.58%, 


s.e. 0.20% 


BLSTM-CTC (prefix search decoding) 


24.4% 




Glass's committee-based classifier [4j 



Table 2: Error rates on TIMIT. Results for BLSTM-CTC are the average 
and standard error (s.e.) over 10 runs. On average, the networks were 
trained for 112.5 epochs (s.e. = 6.4). The horizontal lines divide the list 
of systems into groups performing significantly different than the networks. 
BLSTM-CTC with best path decoding is significantly different from Deng et 
ai:s basehne HMM (two-sided t-test, p<3- 10"^), from BLSTM-CTC with 
prefix search decoding (p < 0.05) and from Glass's classifier (p < 0.004). 
BLSTM-CTC with prefix search decoding is not significantly different from 
either Deng et aVs HTM-HMM or Glass's classifier. 

they are combined with other classifiers. The results shown in table [2] are 
the best results reported in the literature on phoneme recognition in TIMIT. 

5 Conclusions 

We have provided results for phoneme recognition with BLSTM-CTC using 
the TIMIT database. The experiments use the same standard data sets and 
phonetic inventory employed by the systems reportedly having the best per- 
formance to date. Finally, we have compared BLSTM-CTC's performance 
to that achieved by these systems [51 IT]. BLSTM-CTC achieves comparable 
performance without relying on a combination of multiple classifiers. Also, 
BLSTM-CTC makes fewer assumptions about the task domain. 
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